This notebook was adapted from:
a library to analyze *mobility data*, suited for working with:
# import the library
import skmob
import warnings
import geopandas as gpd
import pandas as pd
from skmob.tessellation import tilers
from skmob.utils import plot
import matplotlib.pyplot as plt
from tqdm import tqdm
from stats_utils import *
warnings.filterwarnings('ignore')
tess_style = {'color':'gray', 'fillColor':'gray', 'opacity':0.2}
scikit-mobility provides two user-friendly data structures that extends the pandas DataFrame:
TrajDataFrame - for spatio-temporal **trajectories**FlowDataFrame - for **fluxes** mapped into a tessellationTrajDataFrame¶Each row describes a trajectory's point and contains the following columns:
lat - latitude of the pointlng - longitude of the pointdatetime - date and time of the pointFor multi-user data sets, there are two optional columns:
uid - user's identifier to which the trajectory belongs totid - identifier for the trajectoryA TrajDataFrame can be created from:
DataFramelist¶# From a list
data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
[1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
[1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
[1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
data_list
[[1, 39.984094, 116.319236, '2008-10-23 13:53:05'], [1, 39.984198, 116.319322, '2008-10-23 13:53:06'], [1, 39.984224, 116.319402, '2008-10-23 13:53:11'], [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
We must set the indexes of the mandatory columns using arguments latitude, longitude and datetime.
tdf = skmob.TrajDataFrame(data_list,
latitude=1, longitude=2,
datetime=3)
print(type(tdf))
tdf
<class 'skmob.core.trajectorydataframe.TrajDataFrame'>
| 0 | lat | lng | datetime | |
|---|---|---|---|---|
| 0 | 1 | 39.984094 | 116.319236 | 2008-10-23 13:53:05 |
| 1 | 1 | 39.984198 | 116.319322 | 2008-10-23 13:53:06 |
| 2 | 1 | 39.984224 | 116.319402 | 2008-10-23 13:53:11 |
| 3 | 1 | 39.984211 | 116.319389 | 2008-10-23 13:53:16 |
DataFrame¶# import the pandas library
import pandas as pd
# build a dataframe from the 2D list
data_df = pd.DataFrame(data_list,
columns=['user', 'latitude', 'lng', 'hour'])
print(type(data_df)) # type of the structure
data_df.head() # head of the DataFrame
<class 'pandas.core.frame.DataFrame'>
| user | latitude | lng | hour | |
|---|---|---|---|---|
| 0 | 1 | 39.984094 | 116.319236 | 2008-10-23 13:53:05 |
| 1 | 1 | 39.984198 | 116.319322 | 2008-10-23 13:53:06 |
| 2 | 1 | 39.984224 | 116.319402 | 2008-10-23 13:53:11 |
| 3 | 1 | 39.984211 | 116.319389 | 2008-10-23 13:53:16 |
Note that:
data_df don't match the names requiredlatitude, longitude and datetime # Create a TrajDataFrame from a DataFrame
tdf = skmob.TrajDataFrame(data_df,
latitude='latitude',
datetime='hour',
user_id='user')
print(type(tdf))
tdf.head()
<class 'skmob.core.trajectorydataframe.TrajDataFrame'>
| uid | lat | lng | datetime | |
|---|---|---|---|---|
| 0 | 1 | 39.984094 | 116.319236 | 2008-10-23 13:53:05 |
| 1 | 1 | 39.984198 | 116.319322 | 2008-10-23 13:53:06 |
| 2 | 1 | 39.984224 | 116.319402 | 2008-10-23 13:53:11 |
| 3 | 1 | 39.984211 | 116.319389 | 2008-10-23 13:53:16 |
# create a TrajDataFrame from a dataset of trajectories
tdf = skmob.TrajDataFrame.from_file(
'files/geolife_sample.txt.gz', sep=',')
print(type(tdf))
<class 'skmob.core.trajectorydataframe.TrajDataFrame'>
# explore the TrajDataFrame
tdf.head(5)
| lat | lng | datetime | uid | |
|---|---|---|---|---|
| 0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
| 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
| 2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
| 3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
| 4 | 39.984217 | 116.319422 | 2008-10-23 05:53:21 | 1 |
TrajDataFrame¶crs: the coordinate reference system. Default: epsg:4326 (lat/long)parameters: dictionary to add as many as necessary additional propertiestdf.crs
{'init': 'epsg:4326'}
tdf.parameters
{'from_file': 'files/geolife_sample.txt.gz'}
# add your own parameter
tdf.parameters['compress'] = {'thre': 10}
tdf.parameters
{'from_file': 'files/geolife_sample.txt.gz', 'compress': {'thre': 10}}
Columns of TrajDataFrame have specific types
# In the DataFrame
print(type(data_df))
data_df.dtypes
<class 'pandas.core.frame.DataFrame'>
user int64 latitude float64 lng float64 hour object dtype: object
print(type(tdf)) # In the TrajDataFrame
tdf.dtypes
<class 'skmob.core.trajectorydataframe.TrajDataFrame'>
lat float64 lng float64 datetime datetime64[ns] uid int64 dtype: object
tdf.lat.head()
0 39.984094 1 39.984198 2 39.984224 3 39.984211 4 39.984217 Name: lat, dtype: float64
To write/read a TrajDataFrame into a file, scikit-mobility provides ad-hoc methods.
TrajDataFrame to a file¶parameters and crsattributesdtype of columns with timestamps (time zone info is lost though).skmob.write(tdf, './tdf.json')
tdf.parameters
{'from_file': 'files/geolife_sample.txt.gz', 'compress': {'thre': 10}}
TrajDataFrame from a json file¶# read the file written before
tdf2 = skmob.read('./tdf.json')
tdf2[:4]
| lat | lng | datetime | uid | |
|---|---|---|---|---|
| 0 | 39.984094 | 116.319236 | 2008-10-23 05:53:05 | 1 |
| 1 | 39.984198 | 116.319322 | 2008-10-23 05:53:06 | 1 |
| 2 | 39.984224 | 116.319402 | 2008-10-23 05:53:11 | 1 |
| 3 | 39.984211 | 116.319389 | 2008-10-23 05:53:16 | 1 |
dtypes and the parameters and crs attributes are preserved
print(tdf2.dtypes)
tdf2.parameters
lat float64 lng float64 datetime datetime64[ns] uid int64 dtype: object
{'from_file': 'files/geolife_sample.txt.gz', 'compress': {'thre': 10}}
scikit-mobility relies on the folium library to plot:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9,
tiles='Stamen Toner', start_end_markers=True)
FlowDataFrame¶Each row describes a flow and contains the columns:
origin: ID of the origin tiledestination: ID of the destination tileflow: number of people travelling from origin to destinationEach FlowDataFrame is associated with a **tessellation**, i.e., a GeoDataFrame that contains two columns:
tile_ID, identifier of a locationgeometry, geometric shape of the locationFlowDataFrame¶The FlowDataFrame can be created from:
DataFramemethod from_file creates a FlowDataFrame from a text file with the format:
origin, destination, flow, datetime (optional)tessellation = gpd.GeoDataFrame.from_file(
"files/NY_counties_2011.geojson") # load a tessellation
# create a FlowDataFrame from a file and a tessellation
fdf = skmob.FlowDataFrame.from_file(
"files/NY_commuting_flows_2011.csv",
tessellation=tessellation, tile_id='tile_id', sep=",")
fdf.head()
| flow | origin | destination | |
|---|---|---|---|
| 0 | 121606 | 36001 | 36001 |
| 1 | 5 | 36001 | 36005 |
| 2 | 29 | 36001 | 36007 |
| 3 | 11 | 36001 | 36017 |
| 4 | 30 | 36001 | 36019 |
fdf.dtypes
flow int64 origin object destination object dtype: object
# The tessellation is an attribute of the FlowDataFrame
fdf.tessellation.head()
| tile_ID | population | geometry | |
|---|---|---|---|
| 0 | 36019 | 81716 | POLYGON ((-74.00667 44.88602, -74.02739 44.995... |
| 1 | 36101 | 99145 | POLYGON ((-77.09975 42.27421, -77.09966 42.272... |
| 2 | 36107 | 50872 | POLYGON ((-76.25015 42.29668, -76.24914 42.302... |
| 3 | 36059 | 1346176 | POLYGON ((-73.70766 40.72783, -73.70027 40.739... |
| 4 | 36011 | 79693 | POLYGON ((-76.27907 42.78587, -76.27535 42.780... |
fdf.plot_tessellation(popup_features=['tile_ID', 'population'])
fdf.plot_flows(flow_color='green')
map_f = fdf.plot_tessellation(style_func_args=tess_style)
fdf[fdf['origin'] == '36061'].plot_flows(map_f=map_f, flow_exp=0., flow_popup=True)
# load the dataset using pandas
df = pd.read_csv("files/loc-brightkite_totalCheckins.txt.gz", sep='\t', header=0, nrows=500000,
names=['user', 'check-in_time', "latitude", "longitude",
"location id"])
# convert the pandas DataFrame into an skmob TrajDataFrame
tdf = skmob.TrajDataFrame(df, latitude='latitude',
longitude='longitude', datetime='check-in_time', user_id='user')
print(tdf.shape)
tdf.head()
(500000, 5)
| uid | datetime | lat | lng | location id | |
|---|---|---|---|---|---|
| 0 | 0 | 2010-10-16 06:02:04+00:00 | 39.891383 | -105.070814 | 7a0f88982aa015062b95e3b4843f9ca2 |
| 1 | 0 | 2010-10-16 03:48:54+00:00 | 39.891077 | -105.068532 | dd7cd3d264c2d063832db506fba8bf79 |
| 2 | 0 | 2010-10-14 18:25:51+00:00 | 39.750469 | -104.999073 | 9848afcc62e500a01cf6fbf24b797732f8963683 |
| 3 | 0 | 2010-10-14 00:21:47+00:00 | 39.752713 | -104.996337 | 2ef143e12038c870038df53e0478cefc |
| 4 | 0 | 2010-10-13 23:31:51+00:00 | 39.752508 | -104.996637 | 424eb3dd143292f9e013efa00486c907 |
print("number of users:\t", len(tdf.uid.unique()))
print("number of records:\t", len(tdf))
number of users: 1231 number of records: 500000
characteristic distance traveled by an individual:
$$r_g = \sqrt{\frac{1}{N} \sum_{i=1}^N (\mathbf{r}_i - \mathbf{r}_{cm})^2}$$$r_{cm}$ is the position vector of the center of mass of the set of locations visited by the individual
from skmob.measures.individual import radius_of_gyration
rg_df = radius_of_gyration(tdf)
100%|██████████████████████████████████████| 1231/1231 [00:02<00:00, 551.87it/s]
# let's plot the distribution of the radius of gyration
fig = plt.figure(figsize=(4, 4))
rg_list = list(rg_df.radius_of_gyration[rg_df.radius_of_gyration >= 1.0])
x, y = zip(*lbpdf(1.5, rg_list))
plt.plot(x, y, marker='o')
plt.xlabel('$r_g$ [km]', fontsize=20);plt.ylabel('P($r_g$)', fontsize=20)
plt.grid(alpha=0.2);
plt.loglog();
plt.show()
TrajDataFrame, skmob computes the lengths for each individual independentlyjump_lengths functionfrom skmob.measures.individual import jump_lengths
jl_df = jump_lengths(tdf) # disable progress bar with show_progress=False
jl_df.head(4)
100%|██████████████████████████████████████| 1231/1231 [00:02<00:00, 458.10it/s]
| uid | jump_lengths | |
|---|---|---|
| 0 | 0 | [19.64046732887831, 0.0, 0.0, 1.74343110103816... |
| 1 | 1 | [6.505330424378811, 46.754366003759536, 53.928... |
| 2 | 2 | [0.0, 0.0, 0.0, 0.0, 3.641009719594163, 0.0, 5... |
| 3 | 3 | [3861.270630079885, 4.061631313492122, 5.91632... |
# merge=True put all distances of the individuals into a single list
jl_list = jump_lengths(tdf, merge=True)
type(jl_list)
100%|██████████████████████████████████████| 1231/1231 [00:02<00:00, 464.12it/s]
list
# let's plot the distribution of jump lengths
fig = plt.figure(figsize=(4, 4))
d_list = [dist for dist in jl_list[:10000] if dist >= 1]
x, y = zip(*lbpdf(1.5, d_list))
plt.plot(x, y, marker='o')
plt.xlabel('jump length [km]', fontsize=15);plt.ylabel('P(jump length)', fontsize=15)
plt.grid(alpha=0.2);plt.loglog();plt.show()
maximum_distancefrom skmob.measures.individual import max_distance_from_home, distance_straight_line, maximum_distance
md_df = maximum_distance(tdf)
md_df.head()
100%|██████████████████████████████████████| 1231/1231 [00:02<00:00, 457.83it/s]
| uid | maximum_distance | |
|---|---|---|
| 0 | 0 | 11294.436420 |
| 1 | 1 | 12804.895064 |
| 2 | 2 | 11286.745660 |
| 3 | 3 | 12803.259219 |
| 4 | 4 | 15511.927586 |
# let's plot the distribution
fig, ax1 = plt.subplots(1, 1)
ax1.hist(md_df.maximum_distance, bins=50, rwidth=0.8)
ax1.set_xlabel('max', fontsize=15)
Text(0.5, 0, 'max')
a network where:
from skmob.measures.individual import individual_mobility_network
imn_df = individual_mobility_network(tdf)
imn_df.head()
100%|██████████████████████████████████████| 1231/1231 [00:02<00:00, 419.20it/s]
| uid | lat_origin | lng_origin | lat_dest | lng_dest | n_trips | |
|---|---|---|---|---|---|---|
| 0 | 0 | 37.774929 | -122.419415 | 37.600747 | -122.382376 | 1 |
| 1 | 0 | 37.600747 | -122.382376 | 37.615223 | -122.389979 | 1 |
| 2 | 0 | 37.600747 | -122.382376 | 37.580304 | -122.343679 | 1 |
| 3 | 0 | 37.615223 | -122.389979 | 39.878664 | -104.682105 | 1 |
| 4 | 0 | 37.615223 | -122.389979 | 37.580304 | -122.343679 | 1 |
an_imn = imn_df[imn_df.uid == 2]
an_imn.sort_values(by='n_trips', ascending=False).head(5)
| uid | lat_origin | lng_origin | lat_dest | lng_dest | n_trips | |
|---|---|---|---|---|---|---|
| 1686 | 2 | 39.758302 | -104.936129 | 39.802002 | -105.095430 | 69 |
| 1452 | 2 | 39.802002 | -105.095430 | 39.758302 | -104.936129 | 59 |
| 1493 | 2 | 39.739154 | -104.984703 | 39.802002 | -105.095430 | 52 |
| 1446 | 2 | 39.802002 | -105.095430 | 39.739154 | -104.984703 | 51 |
| 1535 | 2 | 39.739154 | -104.984703 | 39.818040 | -105.081949 | 23 |
number of visits to a location made by the population of individuals
from skmob.measures.collective import visits_per_location
vpl_df = visits_per_location(tdf)
vpl_df.head()
| lat | lng | n_visits | |
|---|---|---|---|
| 0 | 0.000000 | 0.000000 | 10378 |
| 1 | 39.739154 | -104.984703 | 9958 |
| 2 | 40.014986 | -105.270546 | 4548 |
| 3 | 37.774929 | -122.419415 | 3615 |
| 4 | 40.714269 | -74.005973 | 2881 |
fig = plt.figure(figsize=(4, 4))
x, y = zip(*lbpdf(1.5, list(vpl_df.n_visits)))
plt.plot(x, y, marker='o')
plt.xlabel('visits per location', fontsize=15)
plt.loglog()
plt.show()